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Abstract 

We introduce a complexity measure for symbolic sequences. Starting from a segmentation pro- 
cedure of the sequence, we define its complexity as the entropy of the distribution of lengths of 
the domains of relatively uniform composition in which the sequence is decomposed. We show 
that this quantity verifies the properties usually required for a "good" complexity measure. In 
particular it satisfies the one hump property, is super-additive and has the important property of 
being dependent of the level of detail in which the sequence is analyzed. Finally we apply it to the 
evaluation of the complexity profile of some genetic sequences. 
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I. INTRODUCTION 



In the last few years the term complexity has become frequent in scientific hterature 



This has conveyed the introduction of diverse complexity measures in different areas of 
science. Kolgomorov's algorythmic complexity j^, Lempel & Ziv's measure j^, Bennet's 



,0, " 



thermodynamic depth (l|,|6[, physical complexity [7| or Lopez- Ruiz, Mancini & Calvet's 
complexity of the examples that have caught most attention. In fact, 

this list does not reflect all the proposed complexity measures. 

In spite of these efforts, and reflecting such diversity, consensus is to be reached about a 
precise definition of the complexity concept that would allow its quantification. It is possible 
that one of the main difficulties to reach that consensus is the lack of a language that is 
common to all the different areas of science in which the concept is meant to be introduced. 
As an example, the notion of information and its quantifier, the entropy, is usually present in 
measures proposed to evaluate the complexity of a system or of a process. At the same time, 
entropy, in physics is a measure of the disorder of the system, which grows as the disorder 
grows. However, intuitively, a complex system may simultaneously involve order as well as 
disorder. Two extreme cases are to be considered when, in physics, a complexity measure 
is searched. Firstly, a perfect crystal (a completely ordered system) and on the other hand 
the ideal gas (a completely disordered system). Clearly both systems have no complexity 
(or an extremely low complexity). In general, a properly defined complexity measure should 
reach its maximum at some intermediate level between the order of the completely regular 
and the disorder of the absolutely random. This desirable characteristic for all complexity 
measures is known as the one hump property. 

Very often, a complex system is described as one formed by many non-lineal elements 
that interact with each other These interactions give the system the capacity to auto- 
organize jlo|. Given the fact that complexity comes from the interactions of the single units, 
these interactions must be taken into account when defining a measure that quantifies the 
complexity of a system. When the different parts of a system, e.g., the molecules of an 
ideal gas in equilibrium, do not interact, their behavior can be understood as the sum of its 
separated components. But, when interdependencies occur, this is not valid anymore and 
to quantify the complexity we need a measure that takes those bonds into consideration P| . 
An adequate complexity measure should be super-additive, meaning that the two systems' 
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juxtaposition gives as a result a system in which complexity equals or exceeds the addition 
of the considered systems. This means that the (extensive) complexity of the whole is equal 
or larger than the sum of the (extensive) complexities of the parts. Here we are devoted to 
investigate a complexity measure for symbolic sequences. In this case, the super-additive 
property reads as follows: if Cg^ and denote the complexities of two symbolic sequences 
Si and S2, with corresponding lengths Li and L2, then 



where CsiS2 denotes the complexity of the juxtaposition of Si and ^2. 

The complexity measure we introduce in the present work takes into account the lengths 
of the segments of relatively uniform content in which a symbolic sequence is divided. To 
establish the segmentation we must look for compositionally homogeneous segments. Then, 
two extreme cases may occur after the segmentation process: 

• all the resulting segments have the same length (periodic sequence), 

• the sequence has not been segmented (random sequence). 

These two cases correspond with the perfect crystal and the ideal gas mentioned earlier, and 
as we will see, they have a null complexity, according to our definition. Now the next step 
is to characterize what we will take as the most complex sequence, that is, we must fix a 
third point over the complexity plot. In order to do that, we go along the following line of 
reasoning: when the probability, of measuring a particular value of a certain quantity, varies 
inversely as a power of that value, it is said that the quantity follows a power law. The 
importance of the distributions following a power law in physics and related areas has been 
pointed out by the ubiquity of such laws in a wide range of phenomena. This type of laws 
rules as much the frequency of the use of words in any human language as the number of 
moon craters of a particular size h|. In general it is accepted that a power law dependence 
is an indication of hierarchical organization. More interestingly, this kind of behavior also 
appears in brain dynamics studies. In fact, it is known that the brain constantly makes 
complex functional nets corresponding to the traffic between regions. In this case it is found 
that the probability for k regions to be temporarily correlated with a given region satisfies a 
rule k''^ where fi ^ 2 To us, this example proves to be highly significant because brain 
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dynamics is a milestone case of auto-organization and undoubtedly of what we can consider 
as a complex system. At its time, auto-organization is seen as the modelling mechanism to 
a great amount of systems in Nature. 

According to these precedents, we consider reasonable to take as a high complexity se- 
quence, one that has a lengths distribution of patches of relatively uniform composition 
following a power law, i.e. the probability P{1) of finding a patch of relatively homogeneous 
composition with length I, is given by: 

m - l- (2) 

We suppose further that the most complex sequence is the one in which the interdepen- 
dence between subsegments is maximum. To quantify that interdependence, we use the 
autocorrelation function, C{1) |l3^. Interdependence is maximum when the autocorrelation 
function is flat. There exists an interesting relationship between the exponent in and 
the behavior of the autocorrelation function In fact, for a length distribution law given 
by (0) it has been shown that the standard deviation in the symbol content of the sequence, 
F{1), has a behavior of the form 

F{i) ~ r 

and the autocorrelation function follows a power law 

C{1) ~ ^ 



with 7 = 2 — 2a. For an exponent /i < 2 corresponds an exponent a = 1 and therefore 
7 = 0, that is, a flat autocorrelation function Thus, for extremely long sequences a 

fiat autocorrelation is associated to a segments lengths distribution that complies with a 
power law in which fi < 2. It should be emphasized that every exponent ^ <2 leads to a 
fiat autocorrelation function. However the exponent /i = 1 corresponds to a statistically self 
similar distribution of patches along the sequence [1J|. These facts suggest us to take as the 
most complex sequence the one with a lengths distribution of patches of relatively uniform 
composition is given by the law Q with /i = 1. 

This work is organized as follows: In Section II we describe the sequence segmentation 
method implemented; in Section III we introduce a complexity measure and study its basic 
properties; in Section IV we apply the introduced measure to real genomic sequences; finally 
we present some conclusions. 



II. SEGMENTATION METHOD 



In this section we describe the segmentation algorithm apphed to the study of the sequence 
structure. The method is based on the Jensen- Shannon entropic divergence (JSD) and it 
was successfully applied to the study of DNA sequences DNA sequences are formed by 
patches or domains of different nucleotide composition; given the huge spatial heterogeneity 
of most genomes, the identification of compositional patches or domains in a sequence is a 
critical step in understanding large-scale genome structure 

The JSD is a measure of distance between probability distributions. Although it was 
initially defined as a distance between two probability distributions, Lin has proposed a 
generalization to several probability distributions 12|- Let P^^^ = {p[''\ i = 1..N}, k = 1..M, 
a set of M probability distributions (YliPi^^ = 1, k = 1..M), for a discrete variable X with 

possible values Xf, pf^ denotes the probability of occurrence of the value Xj according 
to the distribution P^^\ The JSD for these probabilities distributions is defined by: 

M 



A; A; 



(k) 



where H[P] = —YljPj^'^S2Pj is the Shannon's entropy and the numbers vr 
1..M, Ylk''^^ ~ ^ weights properly chosen. 

The JSD is non negative, bounded and can be interpreted in the frame of information 
Incidentally we mention that the JSD has been proposed as a complexity 



theory |2a] 

measure for genomic sequences jl6| . 

In the context of symbolic sequences analysis, the probabilities pi are approximated by 
the frequency of occurrence of each symbol throughout the sequence. For a DNA sequence, 
the symbols are the nucleotides {A; C;T; G}. If we want to compare the compositional 
content of two symbolic sequences, let us say Si and ^2, of lengths Li and L2, we can 
use the expression (0), where the weights are taken equal to vr*^'^) = Lf^/L, k = 1,2, with 
L = Li + L2. In this case the probability distributions P^^^ and P*-^-* are approximated by 
the frequency of occurrence of the different symbols throughout each sequence. 

The segmentation procedure allows to decompose the sequence into domains or subse- 
quences with a different base composition in comparison to the two adjacent subsequences, 
at a given level of statistical significance or threshold, D^. This threshold is associated with 



the level of details in which the sequence is analyzed 



221. 



In order to make this paper self-contained we will describe the basic steps in the seg- 
mentation procedure. For a more detailed description we refer the reader to reference [l5|. 
Let us suppose that we define a moving cursor along the complete sequence. For each posi- 
tion of the cursor, it results two subsequences, one to the left and other to the right of the 
cursor. For each subsequence we can evaluate the occurrence frequency of each symbol and 
then calculate the JSD for each position of the cursor. The position that corresponds to a 
maximum of the JSD above the threshold elected, Du-, is taken as a cut point. Clearly these 
points corresponds to the maximum of the discrepancy between the compositional content 
of each subsequences. The procedure is repeated for each resulting subsequence until the 
JSD be greater than the threshold value. 

When segmenting symbolic sequences with simple domain structures, homogeneous do- 
mains can be consistently found (if purely random fluctuations are excluded). However, 
when the method is applied to long-range correlated sequences, such homogeneity vanishes: 
by relaxing the threshold value, we find new domains within other domains, previously taken 
as homogeneous under a higher threshold value. This domains-within-domains phenomenon 
points to complex compositional heterogeneity in DNA sequences, which is consistent with 
the hierarchical nature of biological complexity . We will back to this point at the end 
of the present work. 

III. DEFINITION OF THE COMPLEXITY 

Let us consider a symbolic sequence S of length L (i.e., L is the number of symbols in the 
sequence). Let us assume that by segmenting the sequence according to procedure described 
in the preceding section, we can decompose the sequence in Ng patches or domains of different 
compositional content (up to a significance level Dy) Let us denote by = l...A^s, 

the lengths of each one of these segments. Obviously 



In general these lengths are not all different. Let us denote by Vt the subset of lengths U 
such that li 7^ Ij ii i ^ j: 

^ = {(^1, •••,LJ,L, 7^ if i^j,K<Ns} 




(4) 



i=l 
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Let be the number of segments of length /„.. Then Yl'i=i-^oii = Ng. Let us consider 
now an arbitrary partition A = {Aj}'^^^, of the interval [1,L] with v — 1 (the number of 
subintervals), in principle, arbitrary: 

1 = < ^2 < ... < A^_i < Ay = L (5) 

We name the quantity = Aj — Aj_i j = 2, u as the amplitude of the corresponding 
subinterval. 

Let us denote by Nj the number of patches in the segmented sequence with length belong- 
ing to the interval [Aj-i, Aj). The condition X]J=2 ~ is satisfied. Finally let us denote 
by fj the occurrence frequency of segments whose length belongs to the interval [Aj-i,Aj) 
(with the convention that the interval corresponding to j = u includes the extreme value 

i=2 

From the knowledge of the frequencies F = {fj} we can evaluate the Shannon's entropy 

V 

Hs{F- A, D„) = H[F] = ~Y.f^ /. (7) 

i=2 

Clearly this quantity depends on the partition A, and on the significance level at what 
the segmentation was done, that is, it depends on the level of detail at what the sequence 
was analyzed. Therefore we have included explicitly the partition A and the significance 
value Du as arguments in Hs- 

There are two cases in which the entropy ((Zj) does not depend on the particular partition 
chosen: 

1. a idealized periodic sequence and 

2. a idealized random sequence. 

Here what is meant by idealized is that the respective character is detected to every signifi- 
cant level of detail of the analysis. In the first case, there exists only one value (the period) 
for the length of the segments. Therefore fj = l for some value 2 < J < v and fj = for 
all other j. Thus, for a periodic sequence = for any partition of the interval [1,/^]. 
Analogously, due to the fact that a random sequence is not segmented at any significant 
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level of detail (by the proper meaning of significant), only one of the fj is different of zero: 
fy = 1. Thus we also have Hg = is this case. These two extreme cases are the correspond- 
ing ones with the crystal and the isolated ideal gas, in the physical context. In that sense, 
Hs{F; A, Du) is a good candidate as a complexity measure. It should be emphasized that 
Hg has information about the segmentation of the sequence. The fact that Hs vanishes for 
a periodic and a random sequence, suggests to investigate it as a measure of complexity. 
However, it should be also indicated that, in order to be a true characteristic of the sequence 
under study, a complexity measure must be independent of any arbitrary parameter. For it, 
a particular partition is adopted by refining the complexity measure 

Now we proceed to characterize, in a formal way, what we will take as the most complex 
sequence. Let us assume that after the segmentation procedure, at a given level of detail, 
the sequence S is decomposed in Ng segments of uniform compositional content, and let us 
suppose that we are able to identify a power law for the distribution of the segments length: 

where A*) = Eti ^* 

is a cutoff length and jj. > 1. As we indicated in the 
introduction and for the reasons there expressed we chose = 1. The cutoff A* have to do 
with the finite size of the sequence S. Its value can be deduced from the condition 

^'-^-^ 

From the distribution law (jH)), and for a given partition we can evaluate the frequencies 

f^ = j^ E (10) 

" U[A,,Aj+i-l] 

and from these one, the entropy ((Zj). 

At this point we look for the partition A that makes the entropy ((Zj) to reach a maximum 
value when the frequencies fll()|l are replaced. Due to a fundamental property of the entropy, 
the maximum value of Hs{F; A, D^) is reached for a partition A such that all the frequencies 
fj are equal for all j, that is, the number of segments belonging to the interval [Aj_i,Aj) 
is the same for all j. Due to the cutoff, there exists a value j* such that fj = for 
j > j*. Hence, the maximum of the entropy corresponds to the biggest j* consistent with 
the uniformity condition for the fj. The entropy Hs{F; A, Du) will be, in this case, log2 j*. 
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To satisfy the above two conditions, that is, the uniformity of fj for j < j* and the 
biggest value for j*, we must find a partition A of the interval [1, L] such that the number of 
segments in each interval is constant and equal to one. These requirements can be expressed 
as a set of equations to be satisfied by the extremes of each one of the intervals of the 
partition A: 

1 11 1 
1 + — + ... + — — = + ...+ 



2/^ ■■■ {A2-iy A^ (^3-1)'^ 
1 11 1 

+ • • • + 7^ = -TTT + • • • + 



A'^ {A,-iy A>i {A^-iy 



1 111 

• • • + 7^ = -TT, + . . . + 



(11) 



^.-2 (4-1-1)'^ (^*)" 

with /i = 1. 

As we are looking for the maximum j* it is obvious from the previous set of equations 
that we must take A2 = 2. The rest of the amplitudes Aj = Aj — Aj_i can be obtained from 
the set of equations (fTT|) . 

Now we are in position to introduce our complexity measure for an arbitrary symbolic 
sequence S of length L. We define it as: 

Cs = H[Fl], (12) 

where H[Fl] is the entropy of the distribution of lengths of the domains in which the sequence 
has been decomposed, evaluated according to the partition of the interval [1, L] given by the 
relations (fTT|) with /i = 1. 

The evaluation of complexity ()12|) for an arbitrary sequence S of length L requires: 

1. To calculate the partition A corresponding to the length L according to (fTTjl for /i = 1; 

2. by using the segmentation procedure described in section II, at certain significance 
value Du, evaluate the set of length Q and from it the frequencies fj given by © for 
the partition A; 

3. finally, evaluate the entropy Hg given by ((Zj). 

Incidentally it is worth to mention that for a greater value of fi compatible with the 
flat autocorrelation condition (// < 2), the entropy H[Fl\ evaluated following the previously 



described steps, takes values extremely slow. Therefore, besides the conceptual motives that 
led to the election of = 1, there are practical ones as well. 

IV. APPLICATIONS AND RESULTS 

In this section we apply the proposed measure to the evaluation of the complexity for 
some DNA sequences. In all examples the quaternary alphabet {A, T, C, G} is used. These 
evaluations allow us, on one side, to study the main properties of the measure, such as the 
dependence with the level of detail in the analysis of the sequence and the super-additivity 
property; on the other we can investigate our measure as an adequate tool for unravelling 
certain structural features within the DNA, for instance, the content of introns and exons, 
and its relation with evolutionary aspect of the genome. 

As it was already claimed, an appropriate complexity measure should take into account 
the level of detail at what the system under study is analyzed To check this dependence 
we apply the measure (fT^ to real DNA sequences with different correlation structure and 
to a computer generated random sequence. Figure 1 shows the complexity Cg as a function 
of the threshold level, D^, for the genomic sequences HUMTCRADCV, the ECO 11 Ok and 
the random one (this kind of plots are known as complexity profile). The first one is a 
human DNA sequence with long range correlations j^l • The second one is an uncorrelated 
bacterial sequence. A first remarkable aspect of Cs is that there exists a range for the 
significance value D^, 20 < < 50, for which it gets the null value when evaluated for 
the random sequence. This random sequence has been built with identical composition that 
those of the ECOllOk. For belonging to this interval, the values of the complexity for 
the human sequence are greater than those for the bacterial one. This fact is consistent 
with taking as range of interest for the threshold the interval previously indicated. One 
noticeable characteristic of the complexity profiles for the natural sequences, is that, unlike 
those obtained for the complexity measure introduced in do not go to zero as the 
threshold increases. 

Another investigated aspect of has to do with the super-additivity property, eq. (^J). 
In figure 2 we show the complexity profiles for the complete DNA sequences ECOllOk and 
the human beta-globulin HUMHBB, and the weighted sum of the complexity profiles for 
two arbitrary subsequences of these two sequences. Clearly the equation is verified. It 
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is obvious from the definition of Cg that the complexity of any self concatenation of an 
arbitrary sequence is equal to complexity of the original sequence whenever the fusion point 
coincides with a cut point resulting from the segmentation procedure. If this is not the case, 
the resulting value for the complexity of the concatenated sequence might be, for very long 
sequences, slightly different to the complexity of the original sequence. 

It is known that only a small portion of the genome of higher organisms encodes infor- 



mation for amino acid sequences of proteins 



2l| . The role of introns (continuous noncoding 



regions in DNA) and intergenomic sequences (noncoding DNA fragments intertwined be- 
tween coding regions) remain still unknown. The study of the statistical properties of the 
noncoding regions has shown the existence of long range correlations which indicate the 
presence of an underlying structural order in the intron and intergenomic segments. This 
structural order is made apparent in the complexity profiles shown in figure 3, where we 
have plotted the complexity values for the coding and noncoding regions of the human 
chromosome 22. 

Genomic sequences are a valuable source of information about the evolutionary history 
of species j^]. In particular it has been possible to relate some statistical characteristics 
ong genomic sequences to the influences of a variety of ongoing processes including 



observed a 

evolution 2^. In this context we conclude this work evaluating the complexity for 
homologous DNA sequences of different species; in particular for the myosin heavy-chain. In 
general it can be observed that there exists a concordance between the biological complexity 
of the species and the values of C5. It should be emphasized that there exists a relationship 
between the percentage of introns and the long-range correlations in the sequence. This fact 
is clearly manifested by the complexity C5 as can be observed in figure 4. 
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FIG. 1: Complexity profiles of two natural sequences and a computer generated random sequence. 
In this last case, the sequence has the same compositional content that the ECOllOk. 
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FIG. 2: Complexity profiles for the sequences ECOllOk {Leco = 111408 bp) and HUMHBB 
{Lhum = 73308 bp). The filled symbols correspond to the complexity for the whole sequences, and 
the empty ones correspond to the (weighted) sum of the complexities for two arbitrary subsequences 
of each sequence. The subsequences were taken in such a way that their juxtaposition were equal 
to the complete sequence {Lei = 57120 bp and Le2 = 54288 bp; Lhi = 42720 bp and Lh2 = 
30588bp). 
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FIG. 3: Differences in Cs between coding and noncoding regions of the sequence corresponding to 
human chromosome 22. 
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FIG. 4: Complexity profiles of myosin heavy-chain genes in different species (total length, percent- 
age of introns): Human (28438bp, 74%), Rat (25759bp, 77%), Chicken (Sllllbp, 74%), Drosophila 
(22663bp, 66%), Brugia (11766bp, 32%), Acathamoeba (5894bp, 10%), Caenorhabditis (10780bp, 
14%), Yeast (6108bp, 0%) 
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